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[57] ABSTRACT 

A system and method for thermal overload detection and 
protection for a processor which allows the processor to run 
at near maximum potential for the vast majority of its 
execution hfe. This is eifeciuated by the provision of cir- 
cuitry to delect when the processor has exceeded its thermal 
thresholds and which then causes the processor to automati- 
cally reduce the clock rate to a fraction of the nominal clock 
while execution continues. When the thermal condition has 
stabihzed, the clock may be raised in a stepwise fashion 
back to the nominal clock rate. Throughout the period of 
cychng the clock frequency from nominal to minimum and 
back, the program continues to be executed. Also provided 
is a queue activity rise time detector and method to control 
the rate of acceleration of a functional unit from idle to full 
throttle by a localized stall mechanism at the boundary of 
each stage in the pipe. This mechanism can detect when an 
idle queue is suddenly overwhelmed with input such that 
over a short period of approximately 10-20 machine cycles, 
the queue activity rate has increased from idle to near stall 
threshold. 

6 Claims, 6 Drawing Sheets 
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FP/GR MUL PICKER LO (128 TO 2) 
BRANCH PICKER LO (128 TO 1) 

MEM PICKER LO (128 TO 1) 
YOUNGER SQUASHER (128 TO 1) 
OLDEST PICKER (128 TO 1) 
INTEGER PICKER HI/EVEN (64 TO 1) 
INTEGER PICKER LO/EVEN (64 TO 1) 



INSTRUCTION SCHEDULING WINDOW 



INTEGER PICKER LO/ODD (64 TO 1) 
INTEGER PICKER HI/ODD (64 TO 1) 
INT. SPECIAL PIPE PICKER LO (128 TO 1) 
FP SPECIAL PIPE PICKER (128 TO 1) 
MEM PICKER HI (128T0 1) 
BRANCH PICKER HI (128 TO 1) 
FP/GR ADD PICKER HI (128 TO 1) 
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METHOD FOR THERMAL OVERLOAD 
DETECTION AND PREVENTION FOR AN 
INTERGRATED CIRCUIT PROCESSOR 

CROSS-REFERENCES TO RELAI^D 5 
APPLICATIONS 

The subject matter of the present application is related to 
that of co-pending U.S. patent application Ser. No. 08/881, 
958 for AN APPARATUS FOR HANDUNG ALIASED 
FLOATING-POINT REGISTERS IN AN OUT-OF- 10 
ORDER PROCESSOR filed concurrently herewith by 
Ramesh Panwar; Sen No. 08/881,729 for APPARATUS 
FOR PRECISE ARCHITECTURAL UPDATE IN AN OUT- 
OF-ORDER PROCESSOR filed concurrently herewith by 
Ramesh Panwar and Aijun Prabhu; Ser. No. 08/881,726 for 15 
AN APPARATUS FOR NON-INTRUSIVE CACHE FILLS 
AND HANDLING OF LOAD MISSES filed concurrently 
herewith by Ramesh Panwar and Ricky C. Hetherington; 
Ser. No. 08/881,908 for AN APPARATUS FOR HAN- 
DLING COMPLEX INSTRUCTIONS IN AN OUT-OF- 20 
ORDER PROCESSOR filed concurrently herewith by 
Ramesh Panwar and Dani Y. Dakhil; Ser. No. 08/882,173 for 
AN APPARATUS FOR ENFORCING TRUE DEPENDEN- 
CIES IN AN OUT-OF-ORDER PROCESSOR filed concur- 
rently herewith by Ramesh Panwar and Dani Y. Dakhil; Ser. 25 
No. 08/881,145 for APPARATUS FOR DYNAMICALLY 
RECONFIGURING A PROCESSOR filed concurrently 
herewith by Ramesh Panwar and Ricky C. Hetherington; 
Ser. No. 08/881,732 for APPARATUS FOR ENSURING 
FAIRNT2SS OF SHARED EXECUTION RESOURCES 30 
AMONGST MULTIPLE PROCESSES EXECUTING ON 
A SINGLE PROCESSOR filed concurrently herewith by 
Ramesh Panwar and Joseph I. Chamdani; Ser. No. 08/882, 
175 for SYSTEM FOR EFFICIENT IMPLEMENTATION 
OF MULTI-PORTED LOGIC FIFO STRUCTURES IN A 
PROCESSOR filed concurrently herewith by Ramesh Pan- 
war; Ser. No. 08/882^11 for AN APPARATUS FOR MAIN- 
TAINING PROGRAM CORRECTNESS WHILE ALLOW- 
ING LOADS TO BE BOOSTED PAST STORES IN AN 
OUT-OF-ORDER MACHINE filed concurrently herewith 
by Ramesh Panwar, P. K. Chidambaran and Ricky C. 
Hetherington; Ser. No. 08/881,731 for APPARATUS FOR 
TRACKING PIPELINE RESOURCES IN A SUPERSCA- 
LAR PROCESSOR filed concurrently herewith by Ramesh 
Panwar; Ser. No. 08/882,525 for AN APPARATUS FOR 
RESTRAINING OVER-EAGER LOAD BOOSTING IN 
AN OUT-OF-ORDER MACHINE filed concurrently here- 
with by Ramesh Panwar and Ricky C. Hetherington; Ser. 
No. 08/882,220 for AN APPARATUS FOR HANDLING 
REGISTER WINDOWS IN AN OUT-OF-ORDER PRO- 
CESSOR filed concurrendy herewith by Ramesh Panwar 
and Dani Y. Dakhil; Ser. No. 08/881,847 for AN APPARA- 
TUS FOR DELIVERING PRECISE TRAPS AND INTER- 
RUPTS IN AN OUT-OF-ORDER PROCESSOR filed con- 
currently herewith by Ramesh Panwar; Ser. No. 08/881,728 
for NON-BLOCKING HIERARCHICAL CACHE 
THROTTLE filed concurrently herewith by Ricky C. Heth- 
erington and Thomas M, Wicki; Sen No. 08/881,727 for 
NON-THRASHABLE NON-BLOCKING HIERARCHI- 
CAL CACHE filed concurrently herewith by Ricky C. 60 
Hetherington, Sharad Mehrotra and Ramesh Panwar; Ser. 
No. 08/881,065 for IN-LINE BANK CONFLICT DETEC- 
TION AND RESOLUTION IN A MULTI-PORTED NON- 
BLOCKING CACHE filed concurrently herewith by Ricky 
C. Hetherington, Sharad Mehrotra and Ramesh Panwar; and 65 
Ser. No. 08/882,613 for SYSTEM FOR THERMAL OVER- 
LOAD DETECTION AND PREVENTION FOR AN INTE- 
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GRATED QRCUIT PROCESSOR filed concurrently here- 
with by Ricky C. Hetherington and Ramesh Panwar, the 
disclosures of which applications are herein incorporated by 
this reference, 

BACKGROUND OF THE INVENTION 

The present invention relates, in general, to the field of 
integrated circuit ("IC") devices. More particularly, the 
present invention relates to a system and method for thermal 
overload detection and prevention for a processor or other 
high speed, high density integrated circuit devices. 

Early computer processors (also called microprocessors) 
included a central processing unit or instruction execution 
unit that executed only one instruction at a time. As used 
herein the term processor includes complete instruction set 
computers ("CISC"), reduced instruction set computers 
("RISC") and hybrids. In response to the need for improved 
performance several techniques have been used to extend 
the capabilities of these early processors including 
pipelining, superpipelining, superscaling, speculative 
instruction execution, and out-of-order instruction execu- 
tion. 

Pipelined architectures break the execution of instructions 
into a number of stages where each stage corresponds to one 
step in the execution of the instruction. Pipelined designs 
increase the rate at which instructions can be executed by 
allowing a new instruction to begin execution before a 
previous instruction is finished executing. Pipelined archi- 
tectures have been extended to "superpipelined" or 
"extended pipeline" architectures where each execution 
pipeline is broken down into even smaller stages (i.e., 
microinstruction granularity is increased). Superpipelining 
increases the number of instructions that can be executed in 
the pipeline at any given time. 

"Superscalar" processors generally refer to a class of 
microprocessor architectures that include multiple pipelines 
that process instructions in parallel. Superscalar processors 
typically execute more than one instruction per clock cycle, 
on average. Superscalar processors allow parallel instruction 
execution in two or more instruction execution pipelines. 
The number of instructions that may be processed is 
increased due to parallel execution. Each of the execution 
pipelines may have differing number of stages. Some of the 
pipelines may be optimized for specialized functions such as 
integer operations or floating point operations, and in some 
cases execution pipelines are optimized for processing 
graphic, multimedia, or complex math instructions. 

The goal of superscalar and superpipeline processors is to 
execute multiple instructions per cycle ("IPC"). Instruction- 
level parallelism ("ILP") available in programs can be 
exploited to realize this goal, however, this potential paral- 
lelism requires that instructions be dispatched for execution 
at a sufficient rate. Conditional branching instructions create 
a problem for instruction fetching because the instruction 
fetch unit ("IFU") cannot know with certainty which instmc- 
tions to fetch until the conditional branch instruction is 
resolved. Also, when a branch is detected, the target address 
of the instructions following the branch must be predicted to 
supply those instructions for execution. 

Recent processor architectures use a branch prediction 
unit to predict the outcome of branch instructions allowing 
the fetch unit to fetch subsequent instructions according to 
the predicted outcome. Branch prediction techniques are 
known that can predict branch outcomes with greater than 
95% accuracy. These instructions are "speculatively 
executed" to allow the processor to make forward progress 
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during the time the branch instruction is resolved. When the nominal clock rate at a higher frequency than traditional 

prediction is correct, the results of the speculative execution worst case design rules would otherwise permit. 

can be used as correct results, greatly improving processor Nevertheless, with this elevated nominal clock, there are 

speed and efficiency. When the prediction is incorrect, the possible conditions in which the processor might experience 

completely or partially executed instructions must be flushed 5 thermal and transient power conditions that would threaten 

from the processor and execution of the correct branch ^he short and long term reliabihty of the processor 

initiated. this latter regard, disclosed herein are two mechanisms 

.J*** - jj* which can selectively throttle the execution rates of the 

Earlprocessors executed mstruclions m an order deter- ^^^^^^ ^ g^, ^^^^^^ ^ ^^^^^ 

mined by the compiled machine-language program running ^^^j^j ^^^^ j^,^ ,^ capacity in an extremely 

on the processor and so are referred to as in-order" or w ^^^^ ^.^^ ^^^^ ^^^.^^ ^ concomitant current demand in 

"sequential" processors. In superscalar processors multiple ^^^^^ capability of the local power supply. The 

pipelines can simultaneously process instructions only when ^g^^^^^j ^^^^ ^^^^^ ^y this excessive current 

there are no data dependencies between the mstructions in ^^^^^^ threatens noise margins and reduces the designed to 

each pipeline. Data dependencies cause one or more pipe- intearitv of the clock 

'^^ 1°, ^"''fS '^fP''^^"} TTie second mechanism addresses a condition which can 

available. This is further complicated in superpipelmed *u * *i. i * r um » e *u u 

, ^ . . ^ ^ \ . threaten the lone term reliability of the processor such as 

processors where, because many instructions are simulta- . * . •* r n ♦ »• i f i ** ^ 

*^ . , 1 . r ^ . when it operates at its full potential for relatively long 

neously in each pipeline, the potential quantity or data . , - . . . • ♦ ™ 4 . ^ «f 

f . .. ,T 11 t* jL i- periods resulting in an increase in the temperature of the 

dependencies IS large. Hence, greater parallelism and higher ^ . j • j- u ^ . ui i a . «i 
T - 1. .7 r J M . in integrated circuit die beyond acceptable levels. Actual 
performance are achieved by out-of-order processors that 20 ^ysical damage to the silicon can re.sult from operating at 

mclude multiple pipelines m which instructions are pro- ^ capability of the package (inclusive of the 

cessed m parallel in any efficient order that takes advantage j u • i . « \ . ™ u ♦ 

^ . . ^ , . , , . , , die and physical support apparatus) to remove heat, 

of opportunities for parallel processing that may be provided , , '^^^ . rr n- y 

. ,u . * * ,1 ^ In the first mstance, short terra transient conditions that 

by the instruction code. j ji. 

. 25 cause problems with the supply voltage are designed to be 
In any event, processors capable of providing this ^^^^^^^^ ^very major functional unit of the processor, 
parallehsm, and operating at very high frequencies, require .y^^^ "governors" detect and control the rate of acceleration 
millions of densely integrated transistors. Concomitantly ^ functional unit from idle to fuU throttle by a localized 
however, high density devices operating at very high clock g^^u mechanism at the boundary of each stage in the pipe. As 
speeds can result m potentially damaging heat generation ^^^^^ ^^^^^ windows in the design can detect when 
even at relatively low operating voltages. ConveDUonal ^^^^ .r^^^ ^^^^^ ^^^j^. q^^^^ throughput and 
processors, which operate at what are today considered to be ^^^^^ ^ ^^^^^ condition back to each respective queue 
high frequencies with transistor counts in the 10s of ^^^^-^^ ^^^^^ ^^^^ ^ responsible for filling the queue, 
millions, are generally designed to contmually operate mechanism can also detect when an idle queue is 
withm worst case constramts of thermal and transient power suddenly overwhelmed with input such that over a short 
conditions. These constraints place an upper bound on the p^^.^^ approximately 10-20 machine cycles, the queue 
performance of the processor which can actually be much ^^^^-^^ ^^^^ increased from idle to near stall threshold, 
lower than the peak performance of which the device is mechanism functions not to limit the maximum pro- 
capable. Statistically however, not all critical circuits are at ^^^^-^^ ^^^^ processor but rather to control the rise 
their maximum active levels even when the chip is at its ^^^^^^ ^ • functional areas from at or near 
peak processing speed but current analysis models assume -^j^ ^^^^ ^^^^ 

the worst combination. j,^^ second aspect of the present invention addresses long 

Currently there are processors which can step down their device reliability by guarding against excess thermal 

internal clock until they achieve a minimum power con- conditions that could cause harm to the die. A thermal 

sumption level, lliis power down state is entered due to the sensing circuit and method is provided herein that incorpo- 

automatic detection of idle activity and the chip is powered ^^^^^ ^ programmable threshold which, when reached, 

back to nominal clock levels upon receipt of a non-masked causes the circuit to generate a non-masked interrupt to the 

interrupt. Some other implementations suspend execution processor which, in an exemplary embodiment, may be 

while others continue to execute instructions while the clock identical to a power down "Energy Star" interrupt. The 

frequency is being modified. internal phase-locked loop ("PLL") clock dividers may be 

SUMMARY OF THE INVENTION ^'"P^^y^^ '^"^ ""''1' "!°^*^ nominal to, 

for example, V64 of the nominal rate. Program execution 

The system and method for thermal overload detection would then continue at this lowered or reduced clock rate 

and protection for an integrated circuit processor of the until the thermal sensing circuit again senses that a tem- 
present invention allows the processor to run at near maxi- 55 perature threshold has been crossed, whereupon it may again 

mum potential for the vast majority of its execution life. This issue a non-masked interrupt to raise the clock back to 

is effectuated by the provision of circuitry to detect when the nominal firequency. As before, normal program execution 

processor has exceeded its thermal thresholds and which commences at the conclusion of the interrupt, 

then causes the processor to automatically reduce the clock Particulariy disclosed herein is a processor including a 
rate to a fraction of the nominal clock while execution go plurality of instruction processing units having instruction 

continues. When the thermal condition has stabilized, the queues therebetween. The processor comprises a queue 

clock may be raised in a stepwise fashion back to the activity detector for monitoring at least one of the instruction 

nominal clock rate. Throughout the period of cycUng the queues and having a predetermined activity rise time level 

clock frequency from nominal to minimum and back, the threshold therefor. The queue activity detector asserts a stall 
program continues to be executed. 55 signal to an activity source on the instruction queue when the 

The system and method of the present invention is of threshold is exceeded and de-asserts the stall signal when the 

particular utility in allowing a given processor to position its threshold is no longer exceeded. 
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Also disclosed herein is a method for moderating current time detector in accordance with the present invention as 

demand in an integrated circuit processor comprising a shown in FIG. 3 

plurality of instruction processing units having instruction DESCRIPTION OF AN EXEMPLARY 

queues therebetween. The method comprises the steps of EMBODIMENT 

establishing a predetermined instruction activity rise time 5 „ . . , , „ . 

threshold for at least one of the instruction queues and Processor architectures can be represented as a collection 

monitoring an instruction activitv level on the at least one of ?f mteractmg functional umts as shown in FIG. 1. TTiese 

the instruction queues. The method ftirther comprises the ^^ctional umts discussed m greater detail below, perform 

steps of asserting a stall signal to an activity source on the functions of fetching mstructions and data from memory, 

at least one of the instruction queues when the threshold is lO Preprocessing fetched instructions, scheduhng mstructions 

exceeded and de-asserting the staU signal when the threshold *° executed, execuUng the instructions, managmg 

is no longer exceeded. memory transactions, and mterfacing with external cu-cuitry 

^ , , . * . . 1 - . and devices. 

Further disclosed herein is an integrated circuit processor ^ . ■ 

, 1 f • * *' * u ' The present invention is described m terms of apparatus 

mcluding a plurahty of instruction processing units havmg j . . - i , ri ■ - ^ „ . 

. ^ .• 4U u * -ru • i< and methods particularly useful in a superpipelined and 

mstruction queues therebetween. The processor comprises a 15 ^ . r 

1 - -* - *u 1 . * -.u *u superscalar processor 102 shown in block diagram form m 

thermal sensing cu"cuit in thermal contact with the processor. ^ T r^T^ ^ ^. . , 7 

The thermal sensing circuit has a predetermined thermal !^'^, ^ ^ The particular examples represent 

threshold thereof and produces a first output signal on an ™plementat,ons useful m high clock frequency operation 

output thereof when the predetermined thermal threshold is P™,™' '^"Jf ""^ executing multiple instructions 

crossed in a fiist direction. A clock circuit provides a 20 per cycle ( IPC ). However it is expressly understood that 

clocking signal to the processor and is coupled to the output the inventive features of the present invention may be 

of the thermal sensing circuit. The clock circuit is operative "^5""^ ^^^bodied in a number of alternaUve processor 

to reduce a frequency of the clocking signal to a reduced "fh.tectures that will benefit from the performance features 

frequency thereof when the first outpuTsignal is received. In P'^^"' 'nvention. Accordingly, these alternative 

a more particular embodiment, the thermal sensing circuit 25 e^bodmients are equivalent to the particular embodiments 

has an additional predetermined thermal threshold and pro- ^"^ described herein. 

duces a second output signal on the output when the addi- , ^ ^hows a typical genera purpose computer system 

tional predetermined thermal threshold is crossed in a sec- incorporating a processor 102 in accordance with the 

ond opposite direction to cause the clock circuit to increase P^'^^"' mvcntion. Computer system 100 in accordance with 

the frequency of the clock signal back towards a nominal 30 the present invention comprises an address/data bus 101 for 

. frequency communicating information, processor 102 coupled with 

C..11 t It J- 1 J L • • .1. J r .11- bus 101 through input/output ("I/O") device 103 for pro- 

Still further disclosed herein is a method for controUmg ... ~, j . 

... . r • . . J • •. ccssmg data and executing mstructions, and memory system 

an operating temperattue of an mtegrated circuit processor ^ ^ information and 

which composes the Steps of establishmg a predetermmed • . r ^/i-i . 

*u wu u J J - 35 mstfuctions for processor 102. Memorv system 104 

thermal threshold tor the processor and reducing a frequency - v ^ l j • 

tr ,,• c ■ t c composes, for example, cache memory 105 and main 

of a clocking signal to the processor from a nominal fre- ^ ' ^ u me • i j 

J j/ c- . memory 107. Cache memory 105 includes one or more 

quency thereof to a reduced frequency thereof in response to i , r u t / • i l j- 

X . , .1.. t *i. u ij u ■ J J levelsof cache memory. In a typical embodiment, processor 

the predetermined thermal threshold being exceeded. utrt ur\ ^ • j u ^ns 

^ ^ 102, I/O device 103, and some or all of cache memory 105 

BRIEF DESCRIPTION OF THE DRAWINGS 40 may be integrated in a single integrated circuit, although the 

rj, e J J *u r . J u- . Specific componcnts and integration density are a matter of 

The aforementioned and other features and objects of the /. , i.j. j r *-i 

J . r • • *u n design choice selected to meet the needs of a particular 

present invention and the manner of attaining them will .? . ^ 

become more apparent and the invention itself will be best , . , . . 

understood by reference to the following description of an '^^ '^^""^^ are coupled to bus 101 and are 

exemplary embodiment taken in conjunction with the « ope«tive to communicate informalion m appropriately 

accompanying drawings, wherein: ™ '° """^ ."'^ """^f . 1"": 

n,. c . User I/O devices may include a keyboard, mouse, card 

no. 1 IS a funcuonal block diagram of a computer system ^^^^ ^.^ ^i^^ j^^, 

mcorporatmg an apparatus, system and method m accor- ^^^^ ^^^j,^^,^ ■ ^^^j^ ^^^^j^^^ computer. 

dance With the present invention; ^ j • in • i j « u mi u 

^ ' 50 Mass storage device 117 is coupled to bus 101 may be 

no. 2 is a functional block diagram of an integrated implemented using one or more magnetic hard disks, mag- 

circuit implementation of a processor incorporating a ther- ^^^^^ j^pes, CDROMs, large banks of random access 

mal sensing circuit, for example, integrated as a portion of memory, or the like. A wide variety of random access and 

the processor die, m accordance with the present mvention; rg^d only memory technologies are available and are equiva- 

FIG. 3 is a more detailed functional block diagram of an 55 [gnt for purposes of the present invention. Mass storage 117 

exemplary instruction unit, for example, the instruction may include computer programs and data stored therein, 

scheduling unit ("ISU"), forming a portion of the processor Some or all of mass storage 117 may be configured to be 

of the preceding figures which incorporates a queue activity incorporated as a part of memory system 104. 

rise time detector in accordance with the present invention; ^ typical computer system 100, processor 102, I/O 

nC. 4 is a representative overview of the instruction go device 103, memory system 104, and mass storage device 

scheduling window ("ISW") and pickers forming a portion 117^ are coupled to bus 101 formed on a printed circuit board 

of the ISU of FIG. 3; and integrated into a single housing as suggested by the 

FIG. 5 is a representative logic flowchart illustrating the dashed-line box 108. However, the particular components 

possible functionality of a thermal sensing circuit in accor- chosen to be integrated into a single housing is based upon 

dance with the present invention as shown in FIG. 2; and 55 market and design choices. Accordingly, it is expressly 

FIG. 6isa further representative logic flowchart illustrat- understood that fewer or more devices may be incorporated 

ing the possible fiinctionality of a queue activity and rise within the housing suggested by dashed line 108. 
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Display device 109 is used to display messages, data, a microinstructions are executed more efficiently in the execu- 

graphical or command line user interface, or other commu- tion units (e.g., floating point and graphics execution unit 

nications with the user. Display device 109 may be ("FGU") 208 and integer execution unit ("lEU") 214) than 

implemented, for example, by a cathode ray lube (CRT) could the macroinstructions. 

monitor, bquid crystal display (LCD) or any available 5 ISU 206 receives renamed instructions from IRU 204 and 

equivalent. registers them for execution. Upon registration, instructions 

FIG. 2 illustrates the principle components of a mono- ^J^^deemed "live instructioiK" in the specific example. ISU 

i-.u- 11 • . * J • u 1 * »• f 206 IS operative to schedule and dispatch instructions as 

hthically integrated circuit processor 200 implementation of • j j • l c j • . 

^ ^A-» • . J . 1 • i_i 1 J- c 1 soon as their dependencies have been satisfied into an 

the processor 102 m greater detail m block diagram form. It . ^ /ir^nx 
. 1 * J *u * lA-^ u • 1 . J in appropriate execution unit (e.g., integer execution unit (I EU) 

IS contemplated that processor 102 may be implemented ^KX. • . j T.- ** /r-y^m ic^ij^n^ 

^ r 11 1 * J Til u ft* 208 or floating point and graphics unit (FGU) 210). ISU 206 

with more or fewer functional components and still benefit , - . • : . /i- • . \- ic^/t 

r . J *u J ^ *u * ■ *• also mamtains trap status of live uistructions. ISU 206 may 

from the apparatus and methods of the present invention ^ .t. c • u • . • .i_ . 

, J J u • A 1 r 1 perform other functions such as maintammg the correct 

unless expressly specified herein. Also, functional units are . * . i . . c ■ i j* . * 

•J . n J • • 1 . r f J architectural state of processor 102, including state mainte - 

identified using a precise nomenclature for ease of descrip- , * ^ ^ • * • • j icn 
, J . J- 1 ru-ic nance when out-of-order instruction processmg is used. ISU 

tion and understandmg, but other nomenclature often is • i j l • . j- . 

J . -J .-r • 1 r i 206 may mclude mechanisms to redirect, execution appro- 

often used to identify equivalent functional units. • * i t. * • . . j ^ of ■ . 

^ ^ pnately when traps or interrupts occur and to ensure efficient 

Instruction fetch unit ("IFU") 202 comprises instruction execution of multiple threads where multiple threaded 

fetch mechanLsms and includes, among other things, an operation is used. Multiple thread operation means that 

instruction cache for storing instructions, branch prediction processor 102 is mnning muhiple substantially independent 

logic, and address logic for addressing selected instructions processes simultaneously. Multiple thread operation is con- 

m the instruction cache. The mstruction cache is commonly ^^^^^^^ ^^th but not required by the present invention, 

referred to as a portion ("ir) of the level one ("LI") cache 2O6 also operates to retire executed instructions when 

with another portion ( D$ ) ot the LI cache dedicated to completed by lEU 208 and FGU 210. ISU 206 performs the 

data storage. IFU 202 fetches one or more instructions at a appropriate updates to register files and control registers 

time by appropriately addressing the instruction cache. The j^^^ ^^^^<xon of an instruction. ISU 206 is 

instruction cache feeds addressed mstructions to instruction responsive to exception conditions and discards operations 

rename unit ('IRU) 204. Preferably, IFU 202 fetches being performed on instructions subsequent to an instruction 

multiple instnicuons each cycle and m a specific example erating an exception in the program order. ISU 206 

fetches eight instructions each cycle. ^^^^^^ ^^^^^^ instructions from a mispredicted branch and 

In the absence ofconditional branch mstruction, IFU 202 initiates IFU 202 to fetch from the correct branch. An 

addresses the instruction cache sequentially. The branch instruction is retired when it has finished execution and all 

prediction logic in IFU 202 handles branch instructions, instructions from which it depends have completed. Upon 

including unconditional branches. An outcome tree of each retirement the instruction's result is written into the appro - 

branch instruction is predicted using any of a variety of pnate register file and is no longer deemed a "five instmc- 

available branch prediction algorithms and mechanisms. tion". 

More than one branch can be predicted simultaneously by igy 2O8 includes one or more pipelines, each comprising 

supplying sufficient branch prediction resources. After the ^r more stages that implement integer instructions. lEU 

branches are predicted, the predicted address is applied to 2O8 also includes mechanisms for holding the results and 

the instruction cache rather than the next sequential address. ^^^^^ of speculatively executed integer instructions. lEU 208 

IRU 204 comprises one or more pipeline stages that functions to perform final decoding of integer instructions 

include instruction renaming and dependency checking before they are executed on the execution units and to 

mechanisms. The instruction renaming mechanism is opera- determine operand bypassing amongst instructions in an 

tive to map register specifiers in the instructions to physical out-of-order processor. lEU 208 executes all integer instmc- 
register locations and to perform register renaming to pre- 45 tions including determining correct virtual addresses for 

vent false dependencies. IRU 204 further comprises depen- load/store instructions. lEU 208 also maintains correct 

dency checking mechanisms that analyze the instructions to architectural register state for a plurality of integer registers 

determine if the operands (identified by the instructions' in processor 102. lEU 208 preferably includes mechanisms 

register specifiers) cannot be determined until another "live to access single and/or double precision architectural regis- 
instruction" has completed. The term "live instruction" as 50 ters as well as single and/or double precision rename reg- 

used herein refers to any instruction that has been issued to isters. 

an execution pipeline but has not yet completed or been PGU 210, includes one or more pipelines, each compris- 

retired. IRU 204 outputs renamed instructions to instruction ing one or more stages that implement floating point instruc- 

scheduling unit (ISU) 206 tions. FGU 210 also includes mechanisms for holding the 
Program code may contain complex instructions, also 55 results and state of speculatively executed floating point and 

called "macroinstructions", from the running object code. It graphic instructions. FGU 210 functions to perform final 

is desirable in many applications to break these complex decoding of floating point instructions before they are 

instructions into a plurality of simple instructions or "micro- executed on the execution units and to determine operand 

instructions" to simplify and expedite execution. In a spe- bypassing amongst instructions in an out-of-order processor, 
cific implementation, the execution units are optimized to 60 In the specific example, FGU 210 includes one or more 

precisely handle instmclions with a limited number of pipelines dedicated to implement special purpose multime- 

dependencies using a limited number of resources (i.e., dia and graphic instructions that are extensions to standard 

registers). Complex instmctions include any instructions architectural instructions for a processor. FGU 210 may be 

that require more than the limited number of resources or equivalently substituted with a floating point unit (FPU) in 
involve more than the limited number of dependencies. IRU 65 designs in which special purpose graphic and multimedia 

204 includes mechanisms to translate or explode complex instructions are not used. FGU 210 preferably includes 

instructions into a plurality of microinstructions. These mechanisms to access single and/or double precision archi- 
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tectural registers as well as single and/or double precision 
rename registers. 

A data cache memory unit ("DCU") 212 including cache 
memory 105 shown in FIG. 1 functions to cache memory 
reads from off-chip memory through external interface unit 5 
("EIU") 214. Optionally, DCU 212 also caches memory 
write transactions. DCU 212 comprises one or more hier- 
archical levels of cache memory and the associated logic to 
control the cache memory. One or more of the cache levels 
within DCU 212 may be read only memory to eliminate the lO 
logic associated with cache writes. 

'fhe integrated circuit processor 200 shown also com- 
prises a thermal sensing circuit 220 operative ly coupled to 
the processor 200 clock in order to implement one aspect of 
the system and method of the present invention as will be 
more fully described hereinafter with respect to FIG. 5. 

With reference additionally now to FIG. 3, an exemplary 
instruction unit of the processor 102 is shown for purposes 
of illustrating yet another aspect of the system and method 
of the present invention. The instruction unit illustrated is 
the ISU 206 of FIG. 2. Which is responsible for a number of 
the processor 102 functions including: 

scheduling the instructions for execution as soon as their 
dependencies have been satisfied; 

maintaining the trap status of all the instructions; 

maintaining the architectural program counter and other 
special registers; 

maintaining the correct architectural state of the machine 
in spite of executing instructions out-of-order; 30 

redirecting the execution appropriately for traps and inter- 
rupts; and 

ensuring fairness amongst the various threads by appro- 
priately throttling thread fetching and thread schedul- 
ing when the machine is operating in multithreaded 35 
mode. 

The ISU 206 contains several structures which are shown 
in detail in FIG. 3. The IRU 204 (FIG. 2) sends 8 instruction 
bundles containing the instructions, the dependency infor- 
mation regarding the instructions (encoded through pro- 40 
ducer identifications "PIDs"), and the trap status of the 
instructions to the ISU 206. The dependency status along 
with the instruction ready status and other information such 
as the expected latency of the instruction is stored in the 
instruction scheduling window ("ISW") 300 while the 45 
instruction itself is stored in the instruction wait buffer 
("IWB") 302. The ISW 302 is folded for reasons of timing 
as will be more fully described hereinafter and has instruc- 
tion pickers residing on both sides. The instruction pickers 
pick the instructions that are ready for execution by gener- 50 
ating the appropriate word lines for the IWB 302 so that the 
instructions can be read out to be issued to the execution 
units. The instruction as well as the identification of the 
instruction is sent out to the execution units so that the 
execution units can respond back with the trap and comple- 55 
tion status. When the trap and completion status of an 
instruction arrives from the execution units, they are written 
into the instruction retirement window ("IRW") 304. The 
retirement logic looks at contiguous entries in the IRW 304 
and retires them in order to ensure proper update of archi- 60 
tectural state. 

In the embodiment illustrated and described, the proces- 
sor 102 can have 128 instructions alive at any given time and 
there is a 1 — 1 correspondence between the entries in all the 
structures in the ISU 206 and the dependency checking table 65 
("DCT") which resides in the IRU 204. The ISW 300, the 
IWB 302, and the IRW 304 contain 128 entries with each 



entry corresponding to one of the live instructions in the 
processor 102. Since helpers are generated in the first stage 
of register renaming, there may be cases where multiple 
entries correspond to a single complex instruction. In the 
case where multiple entries correspond to a single 
instmction, the entries are all contiguoas and the last entry 
will be marked as an instruction boundary to facilitate 
instruction retirement. 

With reference additionally now to FIG. 4, the ISW 300 
is illustrated in greater detail. Each entry contains 3 PID 
fields, which are indices of older instructions upon whose 
results the current instruction is dependent. While there are 
dependencies that have not been satisfied (older instructions 
have not yet completed to produce the results), the entry 
remains not ready. Once all the dependencies have cleared, 
the entry becomes ready. The pickers monitor the readiness 
state of all the entries and choose the oldest ready instruc- 
tions for issue. Instructions entries that are issued broadcast 
that information so that younger dependent instruction 
entries can become ready and be issued. 

Instruction entries pass through 5 states while in the 
scheduling window: 

Initialized: this is generally the state of new entries as they 
are written into the scheduling window (though they can be 
in the "Ready" state if all source operands are available). 
Entries in this state are wailing for older instructions to 
generate needed results. The valid bit is set and other control 
state (such as the latency counter, instruction type, etc.) is 
initialized. 

Ready: entries move to this state when all 3 potential 
dependencies (indicated by PIDs) are satisfied. One of the 
entry's 8 ready bits goes active. 

Issued: transition to this state occurs when the picker 
associated with the active ready bit sends back a signal 
announcing the entry has been picked for issue. The latency 
counter (described in more detail hereinafter) begins to 
count down. 

Completed: entries move into this state once they have 
been issued and the latency counter =0 (results would be 
available to instructions issued in the next cycle). 

Retired (Invalid): entries move into this state once they 
have been retired (results written from result buffer to 
register file) or when a flush happens that invalidates the 
entry (bad prediction, trap, etc.). The valid bit is cleared. 

Each entry activates 1 of its several ready bits once it 
determines that all its dependencies have cleared. These 
ready bits feed a set of "picker** structures that apply an 
algorithm for selecting which of the ready instructions to 
dispatch in that cycle. This information is passed back into 
the entries and combined with information about the latency 
of the instructions (maintained within the entry) to deter- 
mine when to broadcast completion status. Each entry in the 
window broadcasts its completion state (i.e., has been issued 
and its results available) on a wire (called "sdisp") that spans 
the ISW 300. Every entry also has three 128-bit wide 
multiplexers that select, via decode of the 3 PID fields, the 
correct completion signal to watch. An entry is ready for 
issue when the logical AND of all three multiplexer outputs 
is active. 

Because of the length of the wires involved and the 
loading on each of them (3 * 128 multiplexer inputs), it takes 
a complete cycle to complete this process. Therefore, it is 
not possible to pick a single cycle latency instruction for 
issue and communicate the completion information to 
dependent instructions all within 1 cycle. This requires a 
means of handling the latency sensitive applications involv- 
ing dependency chains of single-cycle latency operations. 
The fast ready mechanism provides this capability. 
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The fast ready mechanism is similar to the slow ready dencies that span more than 8 instructions are forced to have 

one, except that its scope is limited to the nearest 8 younger an additional cycle of latency to wait for full bypassing, 

instructions to a completing instruction. A separate set of The pickers for the memory and branch pipes are imple- 

completion wires ("fdisp") and multiplexers is used. Each mented as 2 independent pick oldest of 128 structures, fed 

entry drives a completion signal to its 8 neighboring younger 5 by separate sets of "hi" and "lo" ready bits. Therefore, each 

entries, each of which has 3 8-to-l multiplexers selecting entry must generate separate hi and lo ready bits for memory 

based on the 3 PID fields. Ready then becomes an OR-AND instructions and branch instructions, for a total of 4 ready 

function: each fast ready multiplexer output is ORed bits (so far...). Due to the timing constraints of the fast ready 

together with the corresponding slow ready multiplexer path, the integer pickers are implemented as 4 independent 

output, and those results are ANDed together. lo pick 1 oldest of 64 structures. The ISW is divided into 2 

As currently contemplated, only the integer execution halves of even and odd entries. Even entries are selected by 

pipes will support single cycle latency instructions, so the 2 of the pickers, and odd entries are selected by the other 2 

fast ready mechanism need only incorporate issue informa- pickers. Coloring is used within the 2 halves, so there is a hi 

tion from the integer pickers. All other pipelines can have a picker and a lo picker for the even and odd sides of the 

minimum 2 cycle latency, which means the slow ready 15 window. Each entry generates a hi and a lo integer ready bit. 

mechanism is adequate for communicating completion The method for preassigning the color of entries as they 

information. However, due to anticipated bypassing restric- are written into the ISW 300 is based on the physical 

tions in the integer datapath, the fast ready mechanism may location in the window, where every consecutive pair of 

need to incorporate information from the memory pipelines entries (e.g. 0 & 1, 2 & 3, . . , ) is assigned alternating colors, 

pipes as well. 20 For example, if entries 0 & 1 are assigned to the hi bank. 

The integer datapath is broken into 2 symmetric banks of entries 2 & 3 would be assigned to the low bank, and so on. 

pipelines, each containing 2 integer pipes, 1 memory pipe. This scheme equally distributes instructions across the 

and 1 branch pipe. Bypassing of results with 0 latency (used pipelines, but has the potential downside of having a regular 

the cycle following the completion of the producing cadence that could be hit by critical code loops such that all 

instruction) is only supported among pipes within each 25 the instructions of a given type get steered to the same 

bank. Bypassing of results between banks (result produced pipeline, starving the other available resources, and yielding 

in one bank, used in the other) requires 1 additional cycle of much worse performance than the processor 102 would 

latency due to physical design constraints. otherwise be capable of delivering. 

Effectively, this means that the availability of results for Full bypassing is supported between all the floating 

use by dependent instructions has variable latency, depend- 30 point/graphics pipelines, so coloring is unnecessary for these 

ing on whether the producer and the consumer of the data go pipelines. Each entry produces 1 ready bit for the FP/GRadd 

to thebanks, Widifferent banks. Without some mechanism to pipes, and 1 ready bit for the FP/GRmul pipes. These feed 

steer dependent instructions to the same execution bank as corresponding pick 2 oldest out of 128 structures. As a 

the producers, all instructions going to the integer datapath result, the total number of ready bits generated by each entry 

would require an additional cycle of latency to allow for full 35 is 8. 

bypassing. There may be some cases where 2 additional cycles of 
Instruction coloring pre assigns entries in the ISW 300 to latency are required to bypass results between certain com- 
one of the two execution banks (arbitrarily named "hi" and binations of instructions within the FP/Graphics unit. The 
"lo") in the integer pipeline and the means for achieving this physical reason is that these cases are not supported by the 
is discussed in more detail hereinafter. Each entry enables a 40 bypassing multiplexer hardware, so results must come from 
ready bit for a particular bank according to a preassigned the result buffer (there is a 2 cycle latency from execution 
color, leaving the ready for the complementary bank turned completion to writing the results into the result buffer). The 
off, guaranteeing that the instruction can only be picked to cases of concern are the so-called "evil-twin/cross- 
be executed in the preassigned bank, precision" operations. The issue is that a given instruction 
The fast ready mechanism is enhanced such that it com- 45 entry has no information about which instructions are depen- 
municates the bank (color) in which a producing instruction dent upon its results, so it has no way of knowing that it 
was executed. That information can be used to override the should delay broadcasting its completion by 2 extra cycles 
preassigned color of the dependent instruction to match that when a dependent instruction causes one of these cases 
of the producer so that it is possible to avoid the extra cycle (plus, there may be other dependent instructions that can use 
needed to support bypassing across the execution banks. To 50 the results via the bypass multiplexers). Therefore, no spe- 
support this, the fast ready mechanism may use 2 sets of cial action is taken by the producing instructions entry in 
completion wires ("fdisp_hi" & "fdisp_lo") along with 2 these cases. The mechanism for handling these cases works 
sets of 8-to-l multiplexers (3 multiplexers per set), one set on the receiving end (i.e., in the dependent instructions 
for each of the 2 colors. Each entry enables the fast comple- window entry). Dependent instructions that fit these cases 
tion wire corresponding to its color (the bank where it was 55 are marked by the DCT in the IRU 204. Any so marked 
executed). One of the 2 colors of ready bits is activated in instruction inserts an extra 2 cycle delay between the tim'e 
dependent entries based on which set of the 3 fast ready that its dependency clears and the activation of its ready bit. 
multiplexers produces an active output. In the implementation shown, it is assumed that there is just 
An instruction can only be ready if all the instructions one such flag per entry, so that if any of the 3 possible 
producing the results it depends on were executed in the 60 dependencies results in one of these cases, the 2 cycle delay 
same bank, or any dependencies that come from the other will be inserted after the last dependency is cleared, whether 
bank completed at least 1 cycle earlier. Note that the slow or not that is the dependency that really causes the problem, 
ready mechanism does not need to carry this color Alternatively, it may be feasible to separately mark each 
information, because it already requires the extra cycle dependency (PID) and insert the 2 cycle delay relative to the 
latency that allows full bypassing across the execution 65 clearing of each separate one. 

banks. So instructions readied by the slow ready mechanism There is a special pipeline in the Integer datapath and two 

always take their default color. This also implies that depen- special pipelines in the FP/GR datapaths. Tliese special 
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pipelines handle long latency instructions (e.g., integer initial nominal level. This may be done in a stepwise 1 

mul/div, FP div/sqrt) and other special instructions that incremental manner (i.e. from VU^ to »>62"^ to Vie'^ to to 

aren't handled by any of the regular pipelines. Rather than Vi'* to !^ of the nominal frequency or in any other appro- I 

dedicate full separate datapaths (specifically register file and priate manner. 

result buffer ports) to handle these special instructions, they 5 The system and method of the present invention addresses 

may be dispatched to the existing integer or FP/GR long term processor 200 rehability by guarding against 

pipelines, which then forward them on to the special hard- excess thermal conditions that could cause harm to the die 

ware required for their execution. and the components integrated thereon. The thermal sensing 

Only one special instruction may be executing at a time circuit 220 and process 500 incorporates a programmable 

within the integer datapath and two instructions in the lo threshold which, when reached, causes the circuit to gener- 

special pipelines in the FP/GR datapaths. A special mecha- ate a non-masked interrupt to the processor 200 which, in an 

nism is required to ensure that only the oldest entry con- exemplary embodiment, may be identical to a power down 

taining a special instruction turns on its ready bit. This is "Energy Star" interrupt. The internal phase -locked loop 

accomplished using special flags, one bit for the integer side ("PLL") clock dividers of the clock 222 may be employed . 

and one for the FP/GR side, that feed picker structures that 15 to step down the master clock 222 from nominal to, for I 

identify the oldest such entry which generate signals that go example, V^'^ of the nominal rate. Program execution would / 

back to the window to tell all younger special entries to keep then continue at this lowered or reduced clock rate until the J 

their ready bit turned off. thermal sensing circuit 220 again senses that a temperature 

Certain instructions must be executed in order relative to threshold has been crossed, whereupon it may again issue a 

either older or younger instructions, or both. Examples of 20 non-masked interrupt to raise the clock frequency back to 

such instructions include those that modify special processor nominal. As before, normal program execution commences 

state (e.g., WRPR % pstate), atomics (e.g., CAS), and at the conclusion of the interrupt. 

membars. These scheduhng restrictions are enforced using With reference additionally now to FIG. 6, a logic flow 

the same kind of mechanism as the one described above for diagram illustrates a process 600 for possible implementa- 

limiting issue of special pipe instructions. One picker-like 25 tion with the queue activity rise time detector 306 of FIG. 3. 

structure monitors the retirement status of all entries, and The process 600 illustrates at step 602 that the queue activity 

generates signals that cause younger entries that have the rise time detector 306 continuously monitors instruction 

appropriate flag set to squash their ready bits if any older queue activity from a queue activity source. In the embodi- 

entries have not retired. A second similar strucmre receives ment illustrated for exemplary purposes in FIG. 3, the ISU 

signals from all entries that indicate if younger instructions 50 206 contains a queue activity rise time detector 306 between 

need to wail for a particular entry to retire before enabling the ISU 206 and the IRU 204 which, in this example, is the 

their ready bits. Signals are generated to squash the ready queue activity source. 

bits of all younger instructions relative to the oldest instruc- At decision step 604, the queue activity rise time detector 

tion that has the appropriate flag set and has not yet retired. determines whether or not the rate of increase in queue 

The 2 mechanisms operate independently, so a particular 35 activity has exceed a predetermined acceptable rate. If it has 

entry may invoke one or the other, or both. not, the process 600 loops at decision step 604. 

Tlie ISW 300 is physically folded, placing odd entities on Alternatively, if the rate has exceeded the predetermined rate 

one side and even entries on the other. This reduces the (for example, from at or near and idle state to at or near a 

height by nearly 50%, which is beneficial for timing. stall condition in between approximately 10 to 20 processor 

Furthermore, the entries are interleaved as shown above to 40 cycles), the process 600 proceeds to step 606 wherein a 

minimize the routing cost of having a circular queue. If the "stall" signal is issued by the queue activity rise time 

interleaving were not done, wires spanning the height of the detector 306 to the activity source (in the example shown, 

window would be required for the fast ready mechanism. the IRU 204). The process 600 then stalls the activity source 

With reference now to FIG. 5, a logic flow for a thermal at decision step 608 until such time as the activity rate drops 

sensing process 500 which may be utilized by the thermal 45 below a predetermined rate and, at that time, the stall signal 

sensing circuit 220 (FIG. 2) is shown. The process 500 is revoked and normal queue activity is resumed at step 610. 

monitors the temperature of the processor 200 at decision As previously noted, short term transient conditions 

step 502 and, if the temperature exceeds or crosses a which may cause problems with the supply voltage are 

predetermined programmed threshold level, the process 500 designed to be detected in every major functional unit of the 

proceeds to step 504. Otherwise, the monitoring continues at 50 processor 200 in addition to the queue activity rise lime 

decision step 502. At step 504, a non-maskable intermpl detector 306 shown in conjunction with the ISU 206 in FIG. 

("NMI") is issued to the processor 200 and the output 3. These "governors" detect and control the rate of accel- 

frequency of the clock 222 (FIG. 2) is decreased to a level eration of a functional instruction unit from at or near idle to 

of Vm''' of its nominal frequency at step 506 although any at or near full throttle by a localized stall mechanism at the 

other suitable fractional reduced rate may be chosen. 55 boundary of each stage in the pipe. As such, all queues and 

At this point, the process 500 proceeds to decision step windows in the design can detect when they are at the "high 

508 where the temperature of the processor 200 is again water mark" of queue throughput and create a stall condition 

monitored to determine if the temperature has dropped back lo each respective queue activity source which is 

below, or crossed, a second predetermined programmable responsible for filHng the queue. This mechanism can also 

threshold (which may or may not be the same level as the 60 detect when an idle queue is suddenly overwhelmed with 

first threshold). If the temperature has not dropped below input such that over a short period of approximately 10-20 

this second threshold level, the process 500 remains at machine cycles, the queue activity rate has increased from at 

decision step 508. Alternatively, if the temperature of the or near idle to a near stall threshold. The queue activity rLse 

processor 200 is below the second threshold level, the time detector 306 and process 600 function not to limit the 

thermal sensing circuit 220 (FIG. 2) issues another NMI to 65 maximum processing rate of the processor 200 but rather to 

the processor 200 at step 510 and instructs the clock 222 control the rise time of activity in all major functional areas 

(FIG. 2) to begin increasing the output clock rate back to its from at or near idle to at or near full speed. 
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u f i^A ^cHr^wn iheinte- claims have been formulated in this application to particular 
In the embodiment above described and shown the in e j „f features, it should be understood that the 

grated circuit processor 200 may be conveniently mipte^ '^^"'^^^ di^^losure herein also includes any novel 
mentedu,Uizing0.18umdestgntechnologyor^ho^^^^^^^^^^ ^Lo, any novel combination of features disclosed either 
greater or lesser device line widths As presently f " . . ^ ^^jti ^ any generahzation or modifacatw^ 
contemplated, the pro<xssor 200 may utihze a clock tre- 5 ^ ° J^^^^^^^^^ J^^nt to persons skilled in the 

quency of 1 GHz or greater with six meta layers and an fj'^^^^^^^^'^" ^^^^^^^ J,^^^^ relates to the same inven- 
operating voltage of 1.5 volts. Metal layers l-» may have a ^'^^^j^;,^^^ i„ ^uim and whether or not it 

minimum uncontacted pitch of 0.5 .-'^^^^''f ""^^ill^rZ lr lToi the same technical problems as 
5-6 may utilize somewhat coarser design rules of on the °"»g^'«J'°y °' , invention. The appUcants hereby 

order of 3 urn pitch. The processor 200 may be conyemenjy to confront«l Pre-nU ^^^^^ PP ^^^^^^^ 

packaged in conventional flipK^hip packagmg or other ^^^^ rnd/or co^btaations of such features during the prosecution 
able configuration. Transistors in proces^r200 i^^^^^^^ and^r .^^ ^^^^ ^pp^^^^,^^^ 

drawn at 0.16 um mimmum gate length. The SRAM ceU sue "[^^.^^.^^^^^'^ 
utiUzed for the L2 cache and 12 tags may be 6.0 um^ and the '^^"^fj j^. 

LI cache may be implemented with a 7.0 um^ cell that 15 ^to^is^cUimed^s^^^^^,^ ^ temperature of a 

predeterled^rogrammable thermal 

All other paths may be pipelined to meet the cycle time goal has been exceeded; 

M curr^^^^^^^^ contemplated, the die size for the processor generating a non-maskable internipt responsive to deter- 

ZOO^av be apL 18 mmxl8 mm and a 512 KB U ^i^^tion dnring said step of determinmg that the pro- 

cLhrJill ocS^^^^ ihin % of the die area. The transistor 25 ^3^^able thermal threshold has been exceeded; and 

count for the design ranges between 40-50 million and the reducing a frequency of a clocking signal generated by the 

oioeline must be carefully floor-planned in order to mim- processor clock to said processor from a nominal 

mize routing between stages. Relatively long conductors frequency thereof to a reduced frequency thereot m 

a«se"t ^ %nificant constraint in minimizing cycle time. response lo said non-maskable interrupt. 

Double widfh and spaced metal 1^ layer conductors can 30 ^ ^^^..^od of claim 1 further comprising the step of: 

only extend about 6 mm in a cycle after aUowing for flop and providing a thermal sensing circuit in thermal coniaci 

clocking signal overhead. The lower resistance metal 5-6 ^^-^ monolithically integrated circuit processor tor 

layers may have signals that can travel up lo 17 mm m a sensing an operating temperature thereof, 

cycle The number of metal 5-6 layer signal conductors will 3 ^^^^^^ ^laim 2 wherein said step of providing 

hkelv be limited however, due lo the fact that approxunately 35 ^^^^^^^ step of: 

50%' of these layers can be occupied by power and clock integrating said thermal sensing circuit on a common 

routing signals. ^ substrate with said monoUthically integrated circuit 

With the foregoing specification, estimated power con^ processor. . . - - 

sumption for the processor 200 is approximately 100 Watts ^ ^^^^^^ ^j^^ ^ therein said step of reducmg is 

at 1 GHz. With scaling of the processor 200 lo higher 40 ^^^^^^ ^^^^ 

operating frequencies, power and cooling systems must be frequency of said clocking signal to sub- 
designed to tolerate power consumption of on the order 01 gtantially V64th of the nominal frequency. 
150 Watts or more (100 Amps at 1.5 volts). ^ method of claim 1 further comprising the step of: 
While there have been described above the principles ot increasing said frequencv of said clocking signal from 
the present invention in conjunction with a speciHc mie- reduced frequency thereof when a second prede- 
grated circuit processor architecmre, it is lo be clearly tg^nined programmable thermal threshold is reached, 
understood that the foregoing descnption is made omy oy ^ wherein said step of increasing 
way of example and not as a limitation to the scope of the 6^ Hie me 

invention which may be ulihzed ^^ ^onjuncUon with any is . ^^^^^^^ ^ ^id frequency of said clocking 

high density integrated circuit intended to -P^^^ ^^^^^^^ '"^t^^^^^ 

clock frequencies. Particularly, it is ^^^^J^at ^^^^^ nom n frequency thereof in a plurality of frequency 

teachings of the foregomg disclosure .-^^ » incTem^^^^ a binary multiple of the 

modifications to those persons skilled in the relevant arU increm ' 

Such modifications may involve other features which are reduced frequency 

already known per se and which may be used instead of or 55 , * » * * 

in addition to features already described herein. Although 
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